The data is recieved from the Amercian Communinity Survey. The data basically consist of the personal information of the people currently residing in United States of America between 2012 - 2016. This is basically an observational study on over 10 Million of records of data being collected.
Here is a glimpse/description of data which is used to perform this analysis.
The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey.The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). The PUMS dataset include variables for nearly every question on the survey. Each record in the file represents a single person.The PUMS contain data on approximately one percent of the United States population.
The data contains 7,487,361 rows and 234 columns. Each row this very data set is scribbled by a response from one person.
Since the data is quite large , I have filtered out only 8 cases which is required for my analysis. The important variables I have choosen for my analysis are as follows :-
| Variable | Description | Data Type | Data Dictionary |
|---|---|---|---|
ST |
State code based on 2010 Census definitions | numerical, discrete / data as numerical | 01 - 72 Allocated Statenames (01 - Alabama/AL , 02- Alaska/AK ). Consist of Abbreviations. |
VALP |
Property value | numerical, discrete / data as numerical | bbbbbbb N/A (GQ/vacant units, except “for-sale-only” and “sold, not occupied”/not owned or being bought) ; 0000001..9999999 $1 to $9999999 (Rounded and top-coded) |
LAPTOP |
Laptop or desktop | categorical, variable / data as numerical | b-N/A (GQ/vacant);1-Yes;2-No |
HISPEED |
Broadband (high speed) Internet service such as cable, fiber optic, or DSL service | categorical, variable / data as numerical | b - N/A (GQ/vacant/no paid access to the internet);1-Yes;2-No |
FULP |
Fuel cost (yearly cost for fuels other than gas and electricity, use ADJHSG to adjust values 3 and over to constant dollars) | numerical, discrete | bbbb - N/A (GQ/vacant);0001 - Included in rent or in condo fee;0002 - No charge or these fuels not used ; 0003..9999 -$3 to $9999 (Rounded and top-coded). |
GASP |
Gas (monthly cost, use ADJHSG to adjust GASP values 4 and over to constant dollars) | numerical, discrete | bbb -N/A (GQ/vacant);001 -Included in rent or in condo fee;002 -Included in electricity payment;003-No charge or gas not used;004..999-$4 to $999 (Rounded and top-coded). |
TYPE |
Type of unit | categorical, variable / data as numerical | 1-Housing unit;2-Institutional group quarters;3-Noninstitutional group quarters. |
TAXP |
Property taxes (yearly amount, no adjustment factor is applied) | categorical, variable / data as numerical | bb-N/A (GQ/vacant/not owned or being bought);01-None;02-$1 - $49;03-$50 - $99 and so on. |
WIF |
Workers in family during the past 12 months | categorical, variable / data as numerical | b -N/A (GQ/vacant/non-family household);0-No Worker;1-1 worker;2-2 worker;3-3 worker. |
HINCP |
Household income (past 12 months, use ADJINC to adjust HINCP to constant dollars) | numerical, discrete / data as numerical | N/A (GQ/vacant) ; 00000000 -No household income;-0059999 -Loss of $59,999 or more; -0059998..-0000001 -Loss of $1 to $59,998;00000001 -$1 or Break even;00000002..99999999 -Total household income in dollars (Components are rounded). |
GRNTP |
Gross rent (monthly amount, use ADJHSG to adjust GRNTP to constant dollars) | numerical, discrete / data as numerical | N/A (GQ/vacant/owned or being bought/occupied without rent payment);00001..99999 -$1 - $99999 (Components are rounded). |
RNTP |
Monthly rent (use ADJHSG to adjust RNTP to constant dollars) | numerical, discrete / data as numerical | N/A (GQ/vacant units, except “for rent” and “rented, not occupied”/owned or being bought/occupied without rent payment);00001..99999 -$1 to $99999 (Rounded and top-coded). |
FINCP |
Family income (past 12 months, use ADJINC to adjust FINCP to constant dollars) | numerical, discrete / data as numerical | N/A (GQ/vacant) ; 00000000 - No family income ;-0059999 -Loss of $59,999 or more; -0059998..-0000001 Loss of $1 to $59,998;00000001 -$1 or Break even;00000002..99999999 -Total family income in dollars (Components are rounded) |
RMSP |
Number of Rooms | numerical, discrete / data as numerical | N/A (GQ); 00..99 -Rooms (Top-coded). |
NP |
Number of persons associated with this housing record | numerical, discrete / data as numerical | 00 -Vacant unit;01 -One person record (one person in household or any person in group quarters);02..20 -Number of person records (number of persons in household). |
ADJINC |
Adjustment factor for income and earnings dollar amounts (6 implied decimal places) | categorical, variable / data as numerical | 1061971 -2013 factor (1.007549 * 1.05401460) ; 1045195 -2014 factor (1.008425 * 1.03646282);1035988 -2015 factor (1.001264 * 1.03468042);1029257 -2016 factor (1.007588 * 1.02150538);1011189 -2017 factor (1.011189 * 1.00000000). |
ACR |
Lot size | categorical, variable / data as numerical | N/A (GQ/not a one-family house or mobile home);1 -House on less than one acre; 2-House on one to less than ten acres;3 -House on ten or more acres. |
FINCP |
Family income (past 12 months, use ADJINC to adjust FINCP to constant dollars) | categorical, variable / data as numerical | N/A (GQ/vacant);00000000 -No family income;-0059999 -Loss of $59,999 or more;-0059998..-0000001 - Loss of $1 to $59,998;00000001 - $1 or Break even;00000002..99999999 -Total family income in dollars (Components are rounded) |
INSP |
Fire/hazard/flood insurance (yearly amount, yearly amount, use ADJHSG to adjust INSP to constant dollars) | numerical, discrete | N/A (GQ/vacant/not owned or being bought);00000 - None ; 00001..10000 - $1 to $10000 (Rounded and top-coded). |
FTAXP |
Property taxes (yearly amount) allocation flag | numerical, discrete | N/A (GQ);No;Yes |
FHINCP |
Household income (past 12 months) allocation flag | numerical, discrete | N/A (GQ);No;Yes |
FINSP |
Fire, hazard, flood insurance (yearly amount) allocation flag | numerical, discrete | N/A (GQ);No;Yes |
YBL |
When structure first built | numerical, discrete | N/A (GQ);1939 or earlier;1940 to 1949;1950 to 1959;1960 to 1969;1970 to 1979;1980 to 1989;1990 to 1999;2000 to 2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017. |
MV |
When moved into this house or apartment | categorical, variable / data as numerical | N/A (GQ);12 months or less;13 to 23 months;2 to 4 years;5 to 9 years;10 to 19 years;20 to 29 years;30 years or more. |
HHL |
Household language | categorical, variable | N/A (GQ);English only;Spanish;Other Indo-European languages;Asian and Pacific Island language;Other language; |
WIF |
Workers in family during the past 12 months | categorical, variable | N/A (GQ);No workers;1 worker;2 worker;3 or More workers in Family. |
VEH |
Vehicles (1 ton or less) available | categorical, variable | N/A (GQ);1 vehicle;2 vehicles;3 vehicles;4 vehicles;5 vehicles;6 vehicle or more. |
## Time difference of 1.140622 mins
| ST | VALP | LAPTOP | HISPEED |
|---|---|---|---|
| Min. : 1.00 | Min. : 100 | Min. :1.0 | Min. :1.0 |
| 1st Qu.:12.00 | 1st Qu.: 100000 | 1st Qu.:1.0 | 1st Qu.:1.0 |
| Median :27.00 | Median : 180000 | Median :1.0 | Median :1.0 |
| Mean :27.83 | Mean : 276505 | Mean :1.2 | Mean :1.2 |
| 3rd Qu.:42.00 | 3rd Qu.: 320000 | 3rd Qu.:1.0 | 3rd Qu.:1.0 |
| Max. :56.00 | Max. :6308000 | Max. :2.0 | Max. :2.0 |
| NA | NA’s :3139781 | NA’s :1355137 | NA’s :2625724 |
| FULP | GASP | TYPE | TAXP | WIF |
|---|---|---|---|---|
| Min. : 1.0 | Min. : 1.0 | Min. :1.00 | Min. : 1 | Min. :0 |
| 1st Qu.: 2.0 | 1st Qu.: 3.0 | 1st Qu.:1.00 | 1st Qu.:19 | 1st Qu.:1 |
| Median : 2.0 | Median : 20.0 | Median :1.00 | Median :31 | Median :2 |
| Mean : 115.6 | Mean : 46.3 | Mean :1.15 | Mean :34 | Mean :1 |
| 3rd Qu.: 2.0 | 3rd Qu.: 60.0 | 3rd Qu.:1.00 | 3rd Qu.:50 | 3rd Qu.:2 |
| Max. :7800.0 | Max. :640.0 | Max. :3.00 | Max. :68 | Max. :3 |
| NA’s :1355137 | NA’s :1355137 | NA | NA’s :3202733 | NA’s :3402406 |
| HINCP | GRNTP | RNTP | FINCP |
|---|---|---|---|
| Min. : -21500 | Min. : 4 | Min. : 4 | Min. : -21500 |
| 1st Qu.: 28600 | 1st Qu.: 670 | 1st Qu.: 520 | 1st Qu.: 39000 |
| Median : 57000 | Median : 940 | Median : 790 | Median : 70000 |
| Mean : 80345 | Mean :1071 | Mean : 921 | Mean : 94503 |
| 3rd Qu.: 100200 | 3rd Qu.:1336 | 3rd Qu.:1200 | 3rd Qu.: 116030 |
| Max. :3209000 | Max. :5022 | Max. :4000 | Max. :3164000 |
| NA’s :1355137 | NA’s :5761729 | NA’s :5660370 | NA’s :3402406 |
| RMSP | NP | ADJINC | ACR |
|---|---|---|---|
| Min. : 1 | Min. : 0.000 | Min. :1011189 | Min. :1.0 |
| 1st Qu.: 4 | 1st Qu.: 1.000 | 1st Qu.:1029257 | 1st Qu.:1.0 |
| Median : 6 | Median : 2.000 | Median :1035988 | Median :1.0 |
| Mean : 6 | Mean : 2.105 | Mean :1036534 | Mean :1.3 |
| 3rd Qu.: 7 | 3rd Qu.: 3.000 | 3rd Qu.:1045195 | 3rd Qu.:1.0 |
| Max. :30 | Max. :20.000 | Max. :1061971 | Max. :3.0 |
| NA’s :740715 | NA | NA | NA’s :2170040 |
| INSP | FTAXP | FHINCP | FINSP | YBL |
|---|---|---|---|---|
| Min. : 0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. : 1.0 |
| 1st Qu.: 450 | 1st Qu.:0.0 | 1st Qu.:0.0 | 1st Qu.:0.0 | 1st Qu.: 3.0 |
| Median : 800 | Median :0.0 | Median :0.0 | Median :0.0 | Median : 5.0 |
| Mean : 988 | Mean :0.1 | Mean :0.3 | Mean :0.1 | Mean : 5.2 |
| 3rd Qu.:1200 | 3rd Qu.:0.0 | 3rd Qu.:1.0 | 3rd Qu.:0.0 | 3rd Qu.: 7.0 |
| Max. :9400 | Max. :1.0 | Max. :1.0 | Max. :1.0 | Max. :21.0 |
| NA’s :3202733 | NA’s :740715 | NA’s :740715 | NA’s :740715 | NA’s :740715 |
| MV | HHL | VEH |
|---|---|---|
| Min. :1.0 | Min. :1.0 | Min. :0.0 |
| 1st Qu.:3.0 | 1st Qu.:1.0 | 1st Qu.:1.0 |
| Median :4.0 | Median :1.0 | Median :2.0 |
| Mean :4.2 | Mean :1.3 | Mean :1.8 |
| 3rd Qu.:6.0 | 3rd Qu.:1.0 | 3rd Qu.:2.0 |
| Max. :7.0 | Max. :5.0 | Max. :6.0 |
| NA’s :1355171 | NA’s :1355137 | NA’s :1355137 |
| Name | mainData |
| Number of rows | 7487361 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| numeric | 25 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ST | 0 | 1.00 | 27.83 | 15.91 | 1 | 12 | 27 | 42 | 56 | ▇▅▆▇▆ |
| VALP | 3139781 | 0.58 | 276504.72 | 383572.87 | 100 | 100000 | 180000 | 320000 | 6308000 | ▇▁▁▁▁ |
| LAPTOP | 1355137 | 0.82 | 1.21 | 0.41 | 1 | 1 | 1 | 1 | 2 | ▇▁▁▁▂ |
| HISPEED | 2625724 | 0.65 | 1.15 | 0.36 | 1 | 1 | 1 | 1 | 2 | ▇▁▁▁▂ |
| FULP | 1355137 | 0.82 | 115.57 | 498.35 | 1 | 2 | 2 | 2 | 7800 | ▇▁▁▁▁ |
| GASP | 1355137 | 0.82 | 46.32 | 71.80 | 1 | 3 | 20 | 60 | 640 | ▇▁▁▁▁ |
| TYPE | 0 | 1.00 | 1.15 | 0.48 | 1 | 1 | 1 | 1 | 3 | ▇▁▁▁▁ |
| TAXP | 3202733 | 0.57 | 33.89 | 19.96 | 1 | 19 | 31 | 50 | 68 | ▆▇▇▅▇ |
| WIF | 3402406 | 0.55 | 1.46 | 0.89 | 0 | 1 | 2 | 2 | 3 | ▃▆▁▇▂ |
| HINCP | 1355137 | 0.82 | 80345.21 | 88145.46 | -21500 | 28600 | 57000 | 100200 | 3209000 | ▇▁▁▁▁ |
| GRNTP | 5761729 | 0.23 | 1070.96 | 612.04 | 4 | 670 | 940 | 1336 | 5022 | ▇▅▁▁▁ |
| RNTP | 5660370 | 0.24 | 920.68 | 597.22 | 4 | 520 | 790 | 1200 | 4000 | ▇▆▁▁▁ |
| FINCP | 3402406 | 0.55 | 94503.35 | 95461.60 | -21500 | 39000 | 70000 | 116030 | 3164000 | ▇▁▁▁▁ |
| RMSP | 740715 | 0.90 | 5.99 | 2.43 | 1 | 4 | 6 | 7 | 30 | ▇▅▁▁▁ |
| NP | 0 | 1.00 | 2.10 | 1.50 | 0 | 1 | 2 | 3 | 20 | ▇▁▁▁▁ |
| ADJINC | 0 | 1.00 | 1036534.06 | 16851.15 | 1011189 | 1029257 | 1035988 | 1045195 | 1061971 | ▇▇▇▇▇ |
| ACR | 2170040 | 0.71 | 1.31 | 0.57 | 1 | 1 | 1 | 1 | 3 | ▇▁▂▁▁ |
| INSP | 3202733 | 0.57 | 987.69 | 976.79 | 0 | 450 | 800 | 1200 | 9400 | ▇▁▁▁▁ |
| FTAXP | 740715 | 0.90 | 0.09 | 0.29 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| FHINCP | 740715 | 0.90 | 0.30 | 0.46 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
| FINSP | 740715 | 0.90 | 0.14 | 0.34 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| YBL | 740715 | 0.90 | 5.24 | 3.19 | 1 | 3 | 5 | 7 | 21 | ▇▅▁▁▁ |
| MV | 1355171 | 0.82 | 4.23 | 1.85 | 1 | 3 | 4 | 6 | 7 | ▆▅▅▇▇ |
| HHL | 1355137 | 0.82 | 1.33 | 0.79 | 1 | 1 | 1 | 1 | 5 | ▇▁▁▁▁ |
| VEH | 1355137 | 0.82 | 1.84 | 1.09 | 0 | 1 | 2 | 2 | 6 | ▇▇▃▁▁ |
Graphical Summary of the data
Above is the fancy tabular and the histo summary of the data I have taken out from the actual data and I am going to work on this data to create the insightful graphics and insights.
From the above it is clearly shown that the data consist of lot of Null Values.
Note : I have used the skimR (skimr handles different data types and returns a skim_df object which can be included in a tidyverse pipeline or displayed nicely for the human reader.) and the pander package for this beautiful and human readable summary of data.
Insightful graphical and tabular summaries of the data
The knowledge you convey regarding the overall make-up of the data you chose to use
Above Tabular format of data shows the first 10 rows of the data and the Sky Blue rectangles reflects that there is no value present in that cell.
This data is basically rooted from American Community Survey. The American Community Survey collects data on an ongoing basis, January through December, to provide every community with the information they need to make important decisions. They release new data every year, in the form of estimates, in a variety of tables, tools, and analytical reports.
After gone throught all the major part parcel of this data , I have decided to work upon the ‘Housing Data’ (The American Community Survey basically publish Housing as well as the Population Data). I think I might be the only one who will work on ‘Housing Data’. The major reason behind that I want to know more deeply about the housing scheme and information present in the United States.
This Housing Data will provide better insights in terms of Demographics information of the United States of America. Through this data you will both learn about the Data profiles and narrative profiles of Housing particulars and facts.
Once I decided to work upon Housing Data , I start mining the columns on which I would like to work upon. After spending couple of days and after performing scrutiny on the columns I have decided to work upon
“ST”, “VALP”,“LAPTOP”,“HISPEED”,“FULP”,“GASP”,“TYPE”,“TAXP”,“WIF”,“HINCP”,“GRNTP”,“RNTP”,“FINCP”,“RMSP”,“NP”,“ADJINC”,“ACR”,“INSP”,“FTAXP”,“FHINCP”,“FINSP”,“YBL”,“MV”,“HHL”,“VEH”.
The above columns which I have decided almost covers all the possible yields and edges of this data. Also these very columns reflects the Economical , Social and Demographical perception of the people of United States.
Meaningful pre-processing : Which Columns contains more Unwanted and Unusual Values
When working with Huge data , it is ones responsiblity to filter out the junk information from the data. Doing this not only clean up the data but also yields the fruitful plus precise(accurate) results from the processing. Also this will helps to reduce the False Postives and False Negatives from the analysis.
I have created Circular Bar plot (One of its own kind) which actually shows which of the columns contains more unuseful information. The bigger the bar the more it contains the null values and the outliers.
I have explicitly handle those null values and outliers while processing which however produces good visualizing results. The three columns Type , NP and ADJINC is missing from this plot since they does not contain any null values. I have explicilty check these columns and they contain important contraints which should be mandatory.
Meaningful pre-processing : Which Columns contains more NULL VALUES (NA) - Dynamic Representation of Percentages(Plotly)
This plot is a dynamic plot , by hovering on the columns of this graph you get the clearly description of the data. Also , you can filter the TRUE and FALSE values here by just clicking on the Label of TRUE/FALSE on the right. We can also save this image by just one click , there is no need to write any code to download that plus you can also ZOOM IN and ZOOM OUT in this graph.
Null Values are the biggest threat to any analysis. Many of the times it happens that results got deviate from the path of accuracy just because of the presence of the NULL Values.
Before we apply any algorithm on our data, it is obvious that the data should be tidy or structured. But in the real world, the data mostly we initially see is unstructured. So in order to make it tidy and to further apply any algorithm to derive the insights, data has to be cleaned. The major reason why the data is not tidy is because of the presence of missing values and outliers.
In the above analysis, I have created a Dynamic Bar Plot which gives the in-depth description of the null value percentages present in the each columns. This plot clearly tells that which data contains how much percentage of actual values and how much percentage of Null Values.
To perform the analysis on this I have taken the housing data and have covered all the major columns which actually give precise and detailed information about the housing information and patterns across the United States of America. Since this data comes with Property Information , Resources Consumption, and their expenditure information , I have majorily and explicitly focus on the analysis which yields the information related to Housing Scheme. In this project I have also covered the variablity of each economic variable and put the light on its usage and on its dominion on other important variables.
Also,this data comes with the demographics information which helps to predict the future outcomes of the usablity in terms of transportation , in terms of land price and I have also discussed about the amenities currently United States people make use of.
Since this data also consist of State Information so it was quite obligatory to discuss about the information with respect to the pattern of data across different states.I have unanimously compared the states in terms of the various statistics which eventually produce Jaw Dropping Results which I initially not expecting (Though I wonder) but comes with true results.
While following the Methodology , following things raised :-
How did you deal with missing values? What impact does your approach have on the interpretation or generalizability of the resulting analysis?
Honsetly speaking, In sum of total, this data consist of more ‘NA’s rather than the actual information. Almost all the columns consist of the ’NA’ present in it which is quite obvious since its an actual and opinion data where information availability is not obvious all the time.
To overcome this problem , I have removed the missing values (NA) from the variables that is involved in our analysis and stored the data into a individual dataframes. Dplyr library helps here to filter the data which is not useful and is not proximity of our usage.
Removing the NA’s from the data not only produce the clean and tidy results but its also produce the valuable and precise results. Removing the NA from the data increase the importance of the column and I have observed the difference between the results with NA and the results without NA. For intance - NA always deviates the result from the actual results/outcome and this would increase the chance of what is called FALSE POSITIVE.
How did you deal with outliers?
How did you deal with weights & income adjustment values?
There is no standard way of addressing weights. While many R functions have a weights parameter, there is no consistency in how they are intepreted: Most commonly, weights in R are interpreted as frequency weights. I have deal it by creating equal mean (Standard Mean) of the data. However Weights generally leads to create the standard error.
The income values present need adjustment which was clearly mention the Data Dictionary. It clearly shows that the income value should be adjusted with ‘ADJHSG’ or with the ‘ADJINC’.
Did you produce any tables or plots that you thought would reveal interesting trends but didn’t?
1.In order to give the states wise family members plotting , I was trying to create a Circular Bar Plot which gives a good view of graph and clearly depicts the results in fancy way but I was not able to make the create the actuall angles which helps me to divide the circular plot into cluster.
2.I was also trying to make the dense Heat map which shows the relation between the Value of the Property and the Amenities Present in the House but due to the factor plotting I was not able to create in that heat map , the darker section shows that the house the maximum amenities holds the maximum value in terms of values of the price.
What’s the analysis that you finally settled on? What relationships do you investigate in the final analysis?
In the findings section I have analysed the major relationship between the economic variables/columns present in the data which actually tells about the economic circumstances of a particular demographics. I have also compared the state wise section which differentiate the coherence and distinctiveness of a state.
Along with that I have also test the actual fact which is a TRUE FACT accroding the statistics on internet and confirm it whether that is actually true or not.
I have also predict the values using Predictive Analysis - Linear Regression, Chi-Square Test and I have also perform the prediction of houses value Using Support Vector Machine (introduce the CARET library).
I have also introduce the Dynamic Graphs using plotly where we can filter the graph on the basis of condition and also check the value of each section along wth its plotting attribute.
Following are the findings I come up with (It consist of the Tabular Summaires , Graphical Summeries , All the predictive outputs , Statistical Significance) :
## # A tibble: 51 x 2
## ST Avg_Value
## <int> <dbl>
## 1 1 165098.
## 2 2 236745.
## 3 4 241961.
## 4 5 142313.
## 5 6 584500.
## 6 8 342265.
## 7 9 397463.
## 8 10 286793.
## 9 11 666619.
## 10 12 263859.
## # … with 41 more rows
Working on the Houses data definitely and should put the light on the value of the Property.
Fact worth Noting - If you add the value of all the homes in the United States together, you get a sum that’s a lot to get your mind around: $31.8 trillion. It’s more than 1.5 times the Gross Domestic Product of the United States and approaching three times that of China.
In this analysis I am examining the states with Higher Property Value. Although I have some pre-thoughts regarding it that there are some states which actually falls in this analysis (of course I know that New York should give higher result, if not then there is some problem because new york is considered to be as the most expensive place on earth).
A statewise map gives a good visualization here. Note that I am calculating the average value.Also , the map with darker region mean that the value in that region is quite (darker means maximum value).
Observation comes Out
Bingo !!
My thought was right , from this map we can easily says that the average value of houses in New York is much more higher than the values in the other state. The cluster of values also helps us to get the utmost values.
We all know that United States is on the top in terms of creating the technoloy as well as the usage of technology.
It is so exciting to check whether the people of united states have good and highspeed internet services or are they actually struggling.
Being a first world Country I am assuming that more than 95% population should have an awesome interent service.
Observation comes Out
I have plotted a pie chart which clearly makes my assumption almost wrong. Almost 87% of United States Population is actually enjoying the high speed internet services where as more than 10 % of the population of United States are still struggling to get the High Speed Internet.
This is the era of 2019, US should do something for these people and I am hoping that US actually starts something to solve this issue.
The fuel and Gas consumption is a biggest thread to Natural Resources since it is getting deplete day by day. It is also well known that United States of America is the only country with maximum consumption of natural resources. In order to check that I have performed a small analysis.
To verify the above statement - I have plotted a bar chart which shows the fuel and gas consumption by states in USA. This eventually helps us to find the reason of why in the specific state this figure is so High !
Here I am assuming that the state with maximum fuel cosumption is the same state having maximum number of Automobiles and Factories. However this graph does not have concrete evidence to prove that the number of automobile figure is high for a specific state.
Depending on where you live, property taxes can be a small inconvenience or a major burden.
Fact worth Noting- More than $14 billion in property taxes go unpaid each year in United States
Since I am working on the Household Data , so it is quite essential or mandatory to find that how much tax (Specifically House Tax) a citizen of United States have to add in his/her expenditure list every year. Although it is worth noted that the tax of house is completely based on the location of the house as well as the state you are living in.
I am expecting that the amenities present in the house and the size of the house does not play a vital role in terms of calculating the House Tax.
Based on this link [link] https://www.businessinsider.com/average-property-taxes-every-us-state#51-alabama-1 : Alabama has the lowest property taxes in the US. Now is the time to check that.
Observation comes Out
States near New York and other states like California , illinois and to name a few has bit higher average property tax cost as compared to other states present in USA.
And Yes ! That link was right , after ploting this map we are sure that Alabama has the lowest property taxes in the US.
## WIF HINCP GRNTP RNTP FINCP
## WIF 1.00 0.33 0.22 0.20 0.35
## HINCP 0.33 1.00 0.49 0.48 0.98
## GRNTP 0.22 0.49 1.00 0.98 0.48
## RNTP 0.20 0.48 0.98 1.00 0.48
## FINCP 0.35 0.98 0.48 0.48 1.00
## WIF HINCP GRNTP RNTP FINCP
## WIF 1.00 0.33 0.22 0.20 0.35
## HINCP 0.33 1.00 0.49 0.48 0.98
## GRNTP 0.22 0.49 1.00 0.98 0.48
## RNTP 0.20 0.48 0.98 1.00 0.48
## FINCP 0.35 0.98 0.48 0.48 1.00
##
## n= 893525
##
##
## P
## WIF HINCP GRNTP RNTP FINCP
## WIF 0 0 0 0
## HINCP 0 0 0 0
## GRNTP 0 0 0 0
## RNTP 0 0 0 0
## FINCP 0 0 0 0
The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. A calculated number greater than 1.0 or less than -1.0 means that there was an error in the correlation measurement. A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows no relationship between the movement of the two variables.
The equation for the correlation
\(\rho X\) Y$ = $( \(X Y\))/ \(\sigma\)X \(\sigma\)Y
WIF -Workers in family during the past 12 months
HINCP - Household income
GRNTP - Gross rent
RNTP - Monthly rent
Above are the columns which actually shows he true color of economical condition of family and a house. In order check whether these are actually correlated to each other we have performed a correlation check between these very columns.
Observation comes Out from this analysis
I have explicitly plotted the ellipse which shows the type of correlation. From these ellipse and the figures confidently states that there is a significant correlation among these variables and it also states that it is a positive correlation.
Also we can specify the significance level using the parameter sig.level = .01 in order to discard the unwanted coorelation.
##
## F test to compare two variances
##
## data: predict.data$RMSP and predict.data$VALP
## F = 0.000000000036489, num df = 4347579, denom df = 4347579,
## p-value < 0.00000000000000022
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.00000000003644095 0.00000000003653797
## sample estimates:
## ratio of variances
## 0.00000000003648941
## [1] 0.2500915
## [1] 1.367129
## [1] 6.586461
##
## Call:
## lm(formula = valueofhouse ~ RMSP, data = predict.data.test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -389784 -129011 -31586 50375 632753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 422720 93541 4.519 0.00013 ***
## RMSP -18546 5768 -3.215 0.00358 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 239500 on 25 degrees of freedom
## Multiple R-squared: 0.2925, Adjusted R-squared: 0.2642
## F-statistic: 10.34 on 1 and 25 DF, p-value: 0.003579
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 422720.22 | 93541.050 | 4.519088 | 0.0001296 |
| RMSP | -18545.58 | 5767.988 | -3.215260 | 0.0035795 |
## [1] 134.3137
Linear regression describes the relationship between a response variable (or dependent variable) of interest and one or more predictor (or independent) variables.
The simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor variable x. The goal is to build a mathematical model (or formula) that defines y as a function of the x variable.
Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.
Formula and basics The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where:
b0 and b1 are known as the regression beta coefficients or parameters: +b0 is the intercept of the regression line; that is the predicted value when x = 0. +b1 is the slope of the regression line.
e is the error term (also known as the residual errors), the part of y that can be explained by the regression model
The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS.
Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called least squares regression or ordinary least squares (OLS) regression.
Correlation Coefficient and Variance Equality The correlation coefficient measures the level of the association between two variables x and y. Its value ranges between -1 (perfect negative correlation: when x increases, y decreases) and +1 (perfect positive correlation: when x increases, y increases).
Compute the correlation coefficient between the two variables using the R function cor(). In our case the the correlaration coefficient is 0.2500915 which basically suggest that the correlation is not too bad A low correlation (-0.15 < x < 0.15).
The variance values shows that they are having same variances (p-value usage).
Computation
The linear model equation can be written as follow: VALP = b0 + b1 * RMSP
Interpretation
+the estimated regression line equation can be written as follow: VALP = 422720 + (-18546)*RMSP
+the intercept (b0) is 422720. It can be interpreted as the predicted sales unit for a zero RMSP advertising budget. Recall that, we are operating in units of thousand dollars.
+the regression beta coefficient for the variable youtube (b1), also known as the slope, is (-18546).
Coefficients significance
t-statistic and p-values:
For a given predictor, the t-statistic (and its associated p-value) tests whether or not there is a statistically significant relationship between a given predictor and the outcome variable, that is whether or not the beta coefficient of the predictor is significantly different from zero.
The statistical hypotheses are as follow:
Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship between x and y)
Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is some relationship between x and y) Mathematically, for a given beta coefficient (b), the t-test is computed as t = (b - 0)/SE(b), where SE(b) is the standard error of the coefficient b. The t-statistic measures the number of standard deviations that b is away from 0. Thus a large t-statistic will produce a small p-value.
The higher the t-statistic (and the lower the p-value), the more significant the predictor. The symbols to the right visually specifies the level of significance. The line below the table shows the definition of these symbols; one star means 0.01 < p < 0.05. The more the stars beside the variable’s p-value, the more significant the variable.
A statistically significant coefficient indicates that there is an association between the predictor (x) and the outcome (y) variable. In our case the p value is 0.00358 which states that its a good measurement for VALP. Also,the T-Static is also deviates from 0 so it clearly shows that beta coefficient of the predictor is significantly different from zero.
Model accuracy
Once you identified that, at least, one predictor variable is significantly associated the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the goodness-of-fit
The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:
| rse | r.squared | f.statistic | p.value |
|---|---|---|---|
| 1 239500 | 0.2642 | 10.34 | 0.003579 |
Residual standard error (RSE). The RSE (also known as the model sigma) is the residual variation, representing the average variation of the observations points around the fitted regression line. This is the standard deviation of residual errors.
RSE provides an absolute measure of patterns in the data that can’t be explained by the model. When comparing two models, the model with the small RSE is a good indication that this model fits the best the data.
Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.
In our example, RSE = 239500, meaning that the observed sales values deviate from the true regression line by approximately 239500 units in average.
R-squared and Adjusted R-squared: The R-squared (R2) ranges from 0 to 1 and represents the proportion of information (i.e. variation) in the data that can be explained by the model. The adjusted R-squared adjusts for the degrees of freedom.
The R2 measures, how well the model fits the data. For a simple linear regression, R2 is the square of the Pearson correlation coefficient.
F-Statistic: The F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.In fact, the F test is identical to the square of the t test: 10.34 = (-3.215)^2
In a simple linear regression, this test is not really interesting since it just duplicates the information in given by the t-test, available in the coefficient table. A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 10.34 producing a p-value of 0.003579, which is highly significant.
## [1] -0.002725119
Correlation coefficients are used in statistics to measure how strong a relationship is between two variables.
The correlation coefficient of two variables in a data set equals to their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.
Here in this we are puting the efforts to check if the HouseHold Income of a Family is actually depicts/correlated to their to living style.
It is quite interested to check if the people of United States with higher household income has a habit of living in the Extra Spacious Area (Big House).
Observation comes Out from this analysis
From the statistcal examiniation/calculations we first remove the outliers and the null values from the data which result the good prediction.
In order to check the correlation we have used the pearson corelation and this corelation is performing on the population data (population correlation coefficient).
By applying the pearson correlation we have the correlation value of -0.002725119 which indicates the weak relation between the variables.
The graph clearly shows that there is certainly no relation between the household income and the house sizes.
It will be so interesting to check if the people of canada is actually worried about their expenditure on house related issues or house insurance. Since the data consist of the Type of Property information and the House Insurance Information. We can easily check if people actually spends on caring of houses or they just take it as secondary expenditures.
Fact Worth Noting - The average homeowners insurance premium rose by 3.6 percent in 2015, following a 3.3 percent increase in 2014, according to a January 2018 study by the National Association of Insurance Commissioners.The average renters insurance premium fell 1.1 percent in 2015 after rising 1.1 percent in 2014.
Observation comes Out from this analysis
Americans usually not spent much money on House Insurance.They not even invest 1/4 of their income on house insurance. May be the reason behind that is the routine mortgages which is actually burgeoning day by day.
We have also checked for the top 5 rich states of United States and there is no significant change for those states as well.
Also, we can say that the study/fact about the decrement in the investment towards house insurance is concrete and proven.
It has been noticed that the median value of a house on 1950 was around 7,400 USD but the shocking part is that the median value of a home in USA now ( 2017 Year ) is around 221,800 USA which is around 30 times of the value in 1950.
It clearly shows that the construction year has also matters a lot. However the size and the location also plays a vital role when it comes to buying a home in United States.
Since our data consist of both value of the home and the construction year , so we can also confirm the above the above statement and can check if the construction year actaully matters a lot when it comes to price adjustment.
Observation comes Out from this analysis
From the graph it is clealy shown that the value of the houses is also depends on the contruction of the Year. From the diagram we can say that the average price of the house in between 1940 and 1990 were around 200,000 USD which has drastically increased to 310,000 (Average Price) in the last decade. It is also clearly shown that the houses built in the year of 2017 has the highest value of prices. So Americans should be ready to loosen their pockets when they even wonder to buy a home.
The America is known to be as a home for different culture people. With different culture people used to speak different languages (Not a their work places but at their home or when with same people around).
People who are living in United states always use english as their language at home. Here we have try to find out that what are the other languages people commonly used at their homes which eventually tells about the demographics.
Observation comes Out from this analysis
From this anaylsis it is clearly shown that maximum number of people who are living here from the past 20 years or more used english as their primary language at home. Also one more point comes from this analysys is that USA has lot of vicinity of Spanish Speakers (or Spaniards) among all other races and languages.
## [1] FALSE
## FINCP VALP
## Min. : 0 Min. : 1700
## 1st Qu.: 43600 1st Qu.: 70000
## Median : 62780 Median : 122500
## Mean : 84840 Mean : 173947
## 3rd Qu.:106138 3rd Qu.: 215500
## Max. :444000 Max. :1843000
## Support Vector Machines with Linear Kernel
##
## 72 samples
## 1 predictor
## 42 classes: '1700', '12000', '13000', '20000', '25000', '30000', '35000', '40000', '50000', '60000', '65000', '70000', '75000', '80000', '84000', '90000', '96000', '100000', '103000', '115000', '120000', '125000', '130000', '135000', '140000', '150000', '160000', '180000', '190000', '200000', '215000', '217000', '220000', '225000', '250000', '300000', '350000', '375000', '400000', '500000', '750000', '829000'
##
## Pre-processing: centered (1), scaled (1)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 66, 63, 67, 66, 61, 64, ...
## Resampling results:
##
## Accuracy Kappa
## 0.03821225 0.02033985
##
## Tuning parameter 'C' was held constant at a value of 1
## [1] 150000 125000 125000 125000 125000 250000 125000 125000 40000 125000
## [11] 40000 125000 40000 125000 125000 125000 150000 125000 125000 125000
## [21] 150000 40000 125000 125000 125000 125000 250000 40000
## 42 Levels: 1700 12000 13000 20000 25000 30000 35000 40000 50000 ... 829000
SVM (Support Vector Machine) is a supervised machine learning algorithm which is mainly used to classify data into different classes. Unlike most algorithms, SVM makes use of a hyperplane which acts like a decision boundary between the various classes. How does it works in our data It draws a decision boundary, i.e. a hyperplane between any two classes in order to separate them or classify them. The basic principle behind SVM is to draw a hyperplane that best separates the 2 classes.
Usage To implement the Support Vector Machine (SVM) we need to install the package called the Caret.The caret package is also known as the Classification And REgression Training, has tons of functions that helps to build predictive models.It contains tools for data splitting, pre-processing, feature selection, tuning, unsupervised learning algorithms, etc.
For the usage purpose I divide my data into train and test. I convert the target variable into factor to get the accuracy based on the predition. I have also declared the train control method since the computational power of my machine is not robust.
To perform the SVM prediction I took sample of records since the computational on the actual data takes lot of memory and also need strong and robust computational and processing power. Used prediction method is “repeatedcv”, and rotation count number is 10 for prediction processing.
The SVM process predicts the data for the Testing data with Accuracy of least 4% (started with 1.88%) which is done on the sample. This can definitely be burgeon and perfectly trained (make that 4.2 % from 1.88 by keep that model trained again and again) under robust computing infrastructure.
Please Note : This graph is created with under the plotly package and it is fully dynamic graph. Hover your mouse on the graph and you can find the statics just pointing on the bar. Also, you can save this chart by just one click. You can also zoom in and zoom out within the graph. You can also filter the data by just clicking on the WIF section
Since the household data consist of the information regarding the number of Workers in a house. Somehow it is interesting to find out that which states has maximum numer of One Workers in house and which state has more than two workers in a house.
This finding also helps to find the earning of house based on the number of workers in a house.
Observation comes Out from this analysis
From the above dynamic graph it is clearly shown that:
##
## Pearson's Chi-squared test
##
## data: tbltest
## X-squared = 912600, df = 18, p-value < 0.00000000000000022
Performing the Chi-Square Test here - Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical
Particularly in this test, we have to check the p-values. Moreover, like all statistical tests, we assume this test as a null hypothesis and an alternate hypothesis.
The main thing is, we reject the null hypothesis if the p-value that comes out in the result is less than a predetermined significance level, which is 0.05 usually, in our case it is less than the significance level so we can easily reject the null hypothesis.
We create a contigency table to perform that.
Observation comes Out from this analysis
We have a high chi-squared value and a p-value of less than 0.05 significance level. So we reject the null hypothesis and conclude that WIF and VEH have a significant relationship. This clearly shows that more number of members in a family actually needs more car as a source of transportation (Although this can also be depend on the income of the house.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 670 940 1071 1336 5022
##
## One-sample Kolmogorov-Smirnov test
##
## data: gross.rent$GRNTP
## D = 0.11121, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
##
## One Sample t-test
##
## data: gross.rent$GRNTP
## t = 126.56, df = 1725631, p-value < 0.00000000000000022
## alternative hypothesis: true mean is not equal to 1012
## 95 percent confidence interval:
## 1070.051 1071.877
## sample estimates:
## mean of x
## 1070.964
In this analysis we are comparing the mean of the gross rent with the actual gross rent of the United States these days. To perform that we are using the T.Test over here. Specifically we are performing the One-Sample T.Test
Fact Noted : From the link [link] https://www.deptofnumbers.com/rent/us/ : I have got to know that the median gross rent in the United States is $1012.
So we can justify this and also check whether is actually true over here or not. We can consider the mu value here as the median gross rent of United States of America
What is One Sample T-test
In simple words , one-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ).
Assumption to Perform the T.test - Data must be normally distributed - We test our normality here by kolmogorov-smirnov test not with the Shapiro-Wilkie Test because our sample size is very and Shapiro wilkie test only take the maximum values of 5000.
Summary of the data
T-test Results
From the results we have got the
Interpretation of the result
The p-value of the test is 2.2e-16, which is less than the significance level alpha = 0.05. We can conclude that the mean value of the gross is significantly different from median value of the gross value in United States.
Basics of Neural Network
A neural network is a model characterized by an activation function, which is used by interconnected information processing units to transform input into output.A neural network has always been compared to human nervous system. Information in passed through interconnected units analogous to information passage through neurons in humans.
The first layer of the neural network receives the raw input, processes it and passes the processed information to the hidden layers.
The hidden layer passes the information to the last layer, which produces the output. The advantage of neural network is that it is adaptive in nature. It learns from the information provided, i.e. trains itself from the data, which has a known outcome and optimizes its weights for a better prediction in situations with unknown outcome.
A perceptron, viz. single layer neural network, is the most basic form of a neural network. A perceptron receives multidimensional input and processes it using a weighted summation and an activation function.
Our case - The objective is to predict VALP by the variables such as INSP, FINCP, FULP , GASP
This tells whether these variables are enough to predict the value of the house. This is quite helpful for a new comer in the area. This eventually can be used to calculate the VALP and to do the town planning as well. Since this tells about the proximity of the resources around.
With the error rate of 1.84 Approx we can say that some of these variables are very helpful in terms of predicting the value of the house and some are causing the randomness in the data which is also acting as an outlier here.
After effectuating and completing the various noteworthy and significant analysis , I have come up with various outcomes and result which clearly tells about the housing pattern in the United States of America and distinguish between each and every variable in terms of maintaning the economy. Through this scrutiny , I got to know more about the demographical information of the United States of America.
Following are the underlying conceptions which yields out of this analysis:-
From the Statewise Analysis , I confirm that the New York is the state with maximum land value - Initially I only heard about that but now I have the concrete evidence to prove the same fact. New York followed by California is the most expensive state in terms of land value and that is point of noting for some one who is working on the economic sector of the United States.
From the analysis I realize that still many people in United States are actually craving for good internet speed which is point worth noting for tele-communication sector. Also the Natural Resources consumption varies from states to states which is not much influential , though we can have a rough idea about that now.
Since this is a housing data , there is a pure chance of calculating the house related tax. From this analysis I come up with a conclusion that the state with the maximum land value is the only state which costs higher house tax. Although I was bit shocked to know that Louisiana costs less in terms of Housing Tax (being the 3rd most populated state with 4.66 Million Americans)
Using the predictive analysis I confirmed that each economic variable in the data is quite important to one other (finding the correlation). The impact on one can be mirror impact on another which is was not quite obvious. In the predictive section I have predicted that whether the Room Numbers actually impact the House Value or not. Confirming the relation I can say that this matters in many different scenarios. Getting an extra room in your house in the United States of America can losen up your pocket. I have also find out that the household income wont decide the Area of the house , the family is living in.
From the analysis I found that Americans dont care much about their houses and invest almost 1/10th of their salary on house insurance. Infact I checked that on major 5 cities including New York but did not get much fruitful results which makes New York exceptional.
I also got to know that the Year of Construction also decided the value of the House which is not pretty much abvious in every country.The houses constructed after the year 2015 has the maximum value in a decade. Also , the United States consist of lot of Spaniards. Spanish speaking people are the second largest housing in United States after English Speakers.
After analysis , I was so fascinated to know that many houses in United States of America has zero worker in a home. Though there are some chance of discripency here because survival in country like United States without earning is quite difficult (Almost Impossible).
Using the Chi Square test , I found that houses with most family also owns the most number of cars in a home which tells that people of United States are concerned about the transportation but not about living stakes.
Are the models you fit believable?
I belive that the models I fit and used are the right models to work upon but in my case the only concern is about traning them in a recursive way so that they yield accpetable accuracy (which I some time lacks). I can also concern here about the machine because most of the model training is done on cloud based technology.
How much confidence do you have in your analysis? Do you believe your conclusions? Are you confident enough in your analysis and findings to present them to policy makers?
I have around 80%-85% confidence and sure about the conclusion I have constructed. Also there is always a room of Improvement everywhere. The conclusion I have constructed is completely helpful for Policy Simulation and Creation in a Country. I completely believe in the effort I have put on and also believes that this is quite helpful for a viewer. I am confident about my outputs and results here. Although I need work more upon the data plotting part.
Code-1 : Reading the data here inside this chunk
# reading the data set only
library(data.table)
library(pander)
library(kableExtra)
library(skimr)
# start_time <- Sys.time()
# mainData <- fread("MainData.csv")
# end_time <- Sys.time()
#end_time - start_time
options(scipen=10000)
start_time <- Sys.time()
# selecting the column on which I have to work upon
cols <- c("ST", "VALP","LAPTOP","HISPEED","FULP","GASP","TYPE","TAXP","WIF","HINCP","GRNTP","RNTP","FINCP","RMSP","NP","ADJINC","ACR","INSP","FTAXP","FHINCP","FINSP","YBL","MV","HHL","VEH")
# using the fread method of the data.table package to read the huge data in less time.
mainData <- fread("MainData.csv",select = cols)
end_time <- Sys.time()
# calculating the time of reading here
end_time - start_time
# getting the fancy summary of data using pander package
pander(summary(mainData))
# getting the summary plus histogram of data using the skim package
skim(mainData)
Code-2 : Getting the insight of data here
library(kableExtra)
library(dplyr)
library(magrittr)
library(DT)
options(DT.options = list(pageLength = 30))
top10.mainData <- mainData[1:10]
top10.mainData %>%
datatable(options = list(dom = "t", ordering = FALSE),
rownames = FALSE,
width = 30) %>%
formatStyle(c("ST", "VALP","LAPTOP","HISPEED","FULP","GASP","TYPE","TAXP","WIF","HINCP","GRNTP","RNTP","FINCP","RMSP","NP","ADJINC","ACR","INSP","FTAXP","FHINCP","FINSP","YBL","MV","HHL","VEH"), backgroundColor = styleEqual(NA, "skyblue"))
Code-3 : Checking the unwanted and null values present in the data
library(dplyr)
library(tidyverse)
library(ggdark)
library(ggplot2)
missing.values <- mainData %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing)
missing.values$id <- seq(1,21)
label_data <- missing.values
number_of_bar <- nrow(label_data)
angle <- 90 - 360 * (label_data$id) /number_of_bar
label_data$hjust<-ifelse( angle < -90, 1, 0)
label_data$angle<-ifelse(angle < -90, angle+180, angle)
ggplot(missing.values, aes(x=as.factor(id), y=num.missing)) + # Note that id is a factor. If x is numeric, there is some space between the first bar
# This add the bars with a blue color
geom_bar(stat="identity", fill=alpha("yellow", 1)) +
# Limits of the plot = very important. The negative value controls the size of the inner circle, the positive one is useful to add size over each bar
ylim(-1,120) +
# Custom the theme: no axis title and no cartesian grid
theme_minimal() +
theme(
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
plot.margin = unit(rep(-1,4), "cm") # Adjust the margin to make in sort labels are not truncated!
) +
# This makes the coordinate polar instead of cartesian.
coord_polar(start = 0) +
# Add the labels, using the label_data dataframe that we have created before
geom_text(data=label_data, aes(x=id, y=num.missing+10, label=key, hjust=hjust), color="White", fontface="bold",alpha=0.6, size=2.5, angle= label_data$angle, inherit.aes = FALSE ) + scale_y_continuous(limits=c(0,100) + scale_x_continuous(limits=c(0,100))) + dark_theme_void() + labs(title = "Circular Bar Plot of missing values")
Code-4 : Calculating the Null values and creating the insight
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyverse)
library(ggdark)
missing.values.percentage <- mainData %>%
gather(key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
group_by(key) %>%
mutate(total = n()) %>%
group_by(key, total, isna) %>%
summarise(num.isna = n()) %>%
mutate(pct = num.isna / total * 100)
levels <-
(missing.values.percentage %>% filter(isna == T) %>% arrange(desc(pct)))$key
percentage.plot <- missing.values.percentage %>%
ggplot() +
geom_bar(aes(x = reorder(key, desc(pct)),
y = pct, fill=isna , width=.8),
stat = 'identity', alpha=0.35 , position = position_dodge() , colour="black") +
scale_x_discrete(limits = levels) +
scale_fill_manual(name = "",
values = c('purple3', 'gold1'), labels = c("Present", "Missing")) +
coord_flip() +
labs(title = "Percentage of missing values", x =
'Variable', y = "% of missing values")
percentage.ploting <- ggplotly(percentage.plot)
percentage.ploting
Code-5 : Calculating and plotting the average property value in USA by states
library(dplyr)
library(ggplot2)
library(ggdark)
valuebystate <- mainData %>% select(ST , VALP)
plotdata <- valuebystate %>% group_by(ST) %>% filter(!any(is.na(ST))) %>% summarise(Avg_Value = mean(VALP,na.rm = TRUE))
statewise.landValue <- valuebystate %>% group_by(ST) %>% filter(!any(is.na(ST))) %>% summarise(Avg_Value = mean(VALP,na.rm = TRUE))
statewise.landValue
tb <- (plotdata$ST)
tbNorm <- (plotdata$Avg_Value)
require ("ggplot2")
require ("choroplethr")
require ("choroplethrMaps")
states <- data.frame ( c (1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51,
53, 54, 55, 56, 72),
c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
"Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois",
"Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts",
"Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada",
"New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota",
"Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island",
"South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia",
"Washington", "West Virginia", "Wisconsin", "Wyoming", "Puerto Rico"))
names (states) = c ("Index", "State")
# takes a table across states as input and prints it on a map
# uses the "choroplethr" package
plot.map <- function (tb, title = "", legend = "") {
i <- match (names (tb), as.character(states$Index))
states <- data.frame (tb, tolower(states$State[i]))
states <- states [,c (3,2)]
# this is required by choroplethr
names (states) <- c ("region", "value")
states$region <- as.character (states$region)
# this is identical to state_choroplethr except that the labels are being removed
c = StateChoropleth$new(states)
c$set_num_colors(7)
c$title = title
c$legend = legend
c$set_zoom(NULL)
c$show_labels = FALSE
c$render()
}
avg.val.data <- (statewise.landValue$Avg_Value)
state.data <- (plotdata$ST)
avg.data <- state.data / avg.val.data
geographical.rep.table <- as.table(setNames(tbNorm,tb))
plot.map (geographical.rep.table, "Average Property Value By States in US", "Cluster of Land Value") + theme(plot.title = element_text(hjust = 0.5)) + dark_theme_linedraw() + theme(legend.background = element_rect(fill="black",
size=0.5, linetype="solid",
colour ="white"))
Code-6 : Calculating and plotting the Laptop users with High speed interent availablity
user.laptop.hispeed <- mainData %>% filter(LAPTOP != 'b' , LAPTOP ==1 , HISPEED !='NA' ) %>% select(LAPTOP , HISPEED)
user.laptop.hispeed.pie <- user.laptop.hispeed %>% group_by(LAPTOP,HISPEED) %>% summarise(COUNT = n()) %>% mutate(lab.ypos = cumsum(COUNT) - 0.5*COUNT)
user.laptop.hispeed.pie$HISPEED <- factor(user.laptop.hispeed.pie$HISPEED , labels = c("Yes" , "No"))
ggplot(user.laptop.hispeed.pie, aes(x = "", y = COUNT , fill = HISPEED)) +
geom_bar(width = 1, stat = "identity", color = "black") +
coord_polar("y", start = 0) +
theme_void() + geom_text(aes(label = paste0(COUNT, " (", scales::percent(COUNT / sum(COUNT)),")")),
position = position_stack(vjust = 0.8) , check_overlap = T , size = 3.5) + labs(title = "Desktop Users With HighSpeed Internet Service" , fill = "Hispeed Connectivity") + theme(legend.text = element_text(face = "italic", colour="steelblue4",family = "Helvetica"),legend.title = element_text(colour = "steelblue", face = "bold.italic", family = "Helvetica")) + theme(
legend.box.background = element_rect(),
legend.box.margin = margin(6, 6, 6, 6)
)
Code-7 : Calculating and plotting the natural resources consumption in united states
library(tidyverse)
# Natural Resources Consumption by States
FuelCost <- mainData %>% select(ST , FULP , GASP) %>% filter(FULP > 2 , GASP >2)
FuelCostPlot <- FuelCost %>% group_by(ST) %>% summarise(Avg_Fuel_Usage = mean(FULP , na.rm = TRUE) , Avg_Gas_Usage = mean(GASP , na.rm = TRUE))
FuelCostPlot$ST <- factor(FuelCostPlot$ST)
levels(FuelCostPlot$ST) <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
"Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois",
"Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts",
"Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada",
"New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota",
"Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island",
"South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia",
"Washington", "West Virginia", "Wisconsin", "Wyoming", "Puerto Rico")
state.wise.consumption.data <- FuelCostPlot %>%
gather("Stat", "Value", -ST)
ggplot(state.wise.consumption.data, aes(x = ST, y = Value, fill = Stat)) +
geom_col(position = "dodge") + coord_flip() +scale_fill_manual(values=c("#D823DE", "#23DEA0")) + theme(legend.box.background = element_rect(),legend.box.margin = margin(6, 6, 6, 6)) + labs(title = "Natural Resources Consumption by States") + theme(axis.title.x = element_blank(),axis.text.y=element_text(size=rel(0.8))) + dark_theme_linedraw() + xlab("State Names") +ylab("Consumption Values") + theme(legend.background = element_rect(fill="black",
size=0.5, linetype="solid",
colour ="white"))
Code-8 : Calculating and plotting the average house tax by states in the United States
library(mapproj)
library(ggplot2)
library(ggdark)
states_data_map <- map_data("state")
Houses.only <- mainData %>% filter(TYPE == 1 , FTAXP == 1) %>% select(ST , TAXP)
avg.tax <- Houses.only %>% group_by(ST) %>% summarise(avg.mean.tax = mean(TAXP , na.rm = T))
ST<-c(1,2,4,5,6,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56,72)
region<-c('alabama','alaska','arizona','arkansas','california','colorado','connecticut','delaware','district of columbia','florida','georgia','hawaii','idaho','illinois','indiana','iowa','kansas','kentucky','louisiana','maine','maryland','massachusetts','michigan','minnesota','mississippi','missouri','montana','nebraska','nevada','new hampshire','new jersey','new mexico','new york','north carolina','north dakota','ohio','oklahoma','oregon','pennsylvania','rhode island','south carolina','south dakota','tennessee','texas','utah','vermont','virginia','washington','west virginia','wisconsin','wyoming','puerto rico')
states.data <- data.frame(ST,region)
states.data$ST <- as.factor(as.character(states.data$ST))
avg.tax$ST <- as.factor(as.character(avg.tax$ST))
common.data <- merge(avg.tax,states.data,by="ST")
mapcreation <- merge(states_data_map, common.data, by="region")
centroids <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
centroids$abb<-state.abb[match(centroids$region,tolower(state.name))]
ggplot(mapcreation, aes(x = long, y = lat, group = group, fill = avg.mean.tax)) +
coord_map("gilbert") +
theme_light() +
scale_fill_continuous(low="Green",high="red",limits=c(min(mapcreation$avg.mean.tax), max(mapcreation$avg.mean.tax))) +
labs(title = "Average House Tax (Property) Figures - By States",fill="Average\nProperty Tax\n Costs($)") + geom_polygon(colour = "black") + theme(strip.background = element_blank(), strip.text.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.line = element_blank(), panel.border= element_blank(), panel.grid = element_blank(), legend.position = "right") + xlab("") + ylab("") + with(centroids, ggplot2::annotate(geom="text", x = long, y=lat, label = abb,
size = 3,color="black",family="Times")) + theme(legend.box.background = element_rect(),legend.box.margin = margin(6, 6, 6, 6)) + theme(plot.title = element_text(hjust = 0.5)) + theme(plot.title = element_text(hjust = 0.5)) + dark_theme_linedraw() + theme(legend.background = element_rect(fill="black",
size=0.5, linetype="solid",
colour ="white"))
Code-9 : Checking and plotting the Economic entities in the United States
# Coorelation between all economic aspect columns which proves the basic income and expenditure in a household.
require(ggpubr)
require(tidyverse)
require(Hmisc)
require(corrplot)
library(ellipse)
library(RColorBrewer)
corr.components <- mainData %>% select(WIF , HINCP , GRNTP , RNTP , FINCP )
corr.components <- na.omit(corr.components)
round(cor(corr.components), 2)
rcorr(as.matrix(corr.components))
M<-cor(corr.components)
corrplot(M, method = "ellipse",col=brewer.pal(n=8, name="PuOr"))
corrplot(M, method = "number",col=brewer.pal(n=8, name="PuOr"))
Code-10 : Predict the Value of the House by the Number of Rooms Available
library(moments)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(broom)
library(ggdark)
# Extracting the valueable columns
predict.data <- mainData %>% filter(RMSP != 'NA' , VALP != 'NA' ) %>% select(RMSP , VALP)
predict.data <- na.omit(predict.data)
# testing the variance of dataset
var.test(predict.data$RMSP,predict.data$VALP)
# checking the correlation between the variables after the filtering.
cor(predict.data$VALP , predict.data$RMSP)
# Checking the normality
#checking the skewness of RMSP column
skewness(predict.data$RMSP)
#checking the skewness of VALP column
skewness(predict.data$VALP)
ggplot(predict.data, aes(x=RMSP)) +
geom_histogram(aes(y=..density..), colour="darkblue", fill="lightblue")+
geom_density(alpha=.2, fill="#FF6666" , color = 'black') + labs(title="Histogram for Room Number") +
labs(x="Number of Rooms", y="Desnity Values") + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(aes(xintercept=mean(RMSP)),
color="blue", linetype="dashed", size=1)
ggplot(predict.data, aes(x=VALP)) +
geom_histogram(aes(y=..density..), colour="darkblue", fill="red")+
geom_density(alpha=.2, fill="#FF6666",color = 'black') + labs(title="Histogram for Value of House") +
labs(x="Value of House", y="Desnity Values") + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(aes(xintercept=mean(RMSP)),
color="red", linetype="dashed", size=1)
# checking the relation by visualization
ggplot(predict.data , aes(x = VALP , y = RMSP)) + geom_point(color = "black") + stat_smooth()
predict.data.test <- predict.data %>% group_by(RMSP) %>% dplyr::summarise(valueofhouse = n())
lm.method.predict <- lm(VALP ~ RMSP, data =
predict.data)
lm.method.predict.test <- lm(valueofhouse ~ RMSP, data =
predict.data.test)
summary(lm.method.predict.test)
lm.method.predict.summary <- summary(lm.method.predict.test)
kable(lm.method.predict.summary$coefficients) %>%
kable_styling("striped", full_width = F) %>%
row_spec(0:2, bold = T, color = "black", background = "skyblue")
sigma(lm.method.predict)*100/mean(predict.data$VALP)
qqnorm(lm.method.predict$residuals, col = "yellow")
qqline(lm.method.predict$residuals, col = "blue")
lm.method.predict.test.metrics <- augment(lm.method.predict.test)
ggplot(lm.method.predict.test.metrics, aes(RMSP, valueofhouse)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = RMSP, yend = .fitted), color = "red", size = 0.3) + annotate("rect", xmin=c(3.5,15), xmax=c(9.5,26), ymin=c(250000,0) , ymax=c(750000,150000), alpha=0.2, color="white", fill="blue") + dark_theme_linedraw() + annotate("text", x=11, y=820000, label= "Maximum Residual Value \n (R.S.S value very high)") + annotate("text", x=22, y=220000, label= "Minimum Residual Value \n (R.S.S value Accurate)") + xlab("Room Space") + ylab("Value of the House")
Code-11 : Does House Income depicts the standard of Living in the United States
require(ggpubr)
# Correlation between the Income and the House size
income.house.data <- mainData %>% select(HINCP , NP , ADJINC , ACR) %>% filter(ACR != 'NA')
income.house.data <- na.omit(income.house.data)
income.house.data$AdjustedHousehold_Inc<-income.house.data$HINCP*unique(income.house.data$ADJINC/1E06,incomparables = FALSE)
# removing the outliers
outliers<-boxplot(income.house.data$AdjustedHousehold_Inc,plot=F)$out
# finding the corelation
plot(income.house.data$ACR,income.house.data$AdjustedHousehold_Inc,col="blue")
income.house.data.df<- income.house.data[,c("AdjustedHousehold_Inc","ACR")]
income.house.data.df1<- income.house.data.df[complete.cases(income.house.data.df),]
cor(income.house.data.df1$AdjustedHousehold_Inc,as.numeric(income.house.data.df1$ACR))
ggscatter(income.house.data.df1, x = "AdjustedHousehold_Inc", y = "ACR",
add = "reg.line",
add.params = list(color = "blue", fill = "lightgray"),
conf.int = TRUE
) + stat_cor(method = "pearson", label.x = 3, label.y = 30)
Code-12 : How much Americans Care About Their Houses.
library(ggdark)
library(viridis)
#how much americans care about their homes
house.insurance.care <- mainData %>% select(TYPE , FINSP , FHINCP , HINCP , INSP , ST)
# gettng the type of unit = 1 , Fire, hazard, flood insurance = 1 and household income
house.insurance.cols <- subset(house.insurance.care,TYPE==1 & FINSP==1 & FHINCP==1)
house.data<- house.insurance.cols[,c('ST','INSP','HINCP')]
unique.states <- unique(house.data$ST)
insurance.data <- as.data.frame(tapply(house.data$INSP,house.data$ST,mean))
colnames(insurance.data) <- c("Avg")
dataframe.insu <- as.data.frame(cbind(ST=unique.states,Avg=insurance.data$Avg))
dataframe.insu$flag <- c("Insurance")
income.data = as.data.frame(tapply(house.data$HINCP,house.data$ST,mean))
colnames(income.data) <- c("Avg")
dataframe.inc <- as.data.frame(cbind(ST=unique.states,Avg=income.data$Avg))
dataframe.inc$flag <- c("Income")
dataframe.insu$ST=as.factor(as.character(dataframe.insu$ST))
dataframe.inc$ST=as.factor(as.character(dataframe.inc$ST))
insurance.house.df <- rbind(dataframe.insu,dataframe.inc)
ST<-c(1,2,4,5,6,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56,72)
region<-c('alabama','alaska','arizona','arkansas','california','colorado','connecticut','delaware','district of columbia','florida','georgia','hawaii','idaho','illinois','indiana','iowa','kansas','kentucky','louisiana','maine','maryland','massachusetts','michigan','minnesota','mississippi','missouri','montana','nebraska','nevada','new hampshire','new jersey','new mexico','new york','north carolina','north dakota','ohio','oklahoma','oregon','pennsylvania','rhode island','south carolina','south dakota','tennessee','texas','utah','vermont','virginia','washington','west virginia','wisconsin','wyoming','puerto rico')
housing.states <- data.frame(ST,region)
housing.states$ST <- as.factor(as.character(housing.states$ST))
insurance.house.df$ST=as.factor(as.character(insurance.house.df$ST))
all.housing.insu <- merge(insurance.house.df,housing.states,by="ST")
all.housing.insu.top5 = subset(all.housing.insu, ST ==6 | ST == 11 | ST == 25 |ST == 36 |ST == 53 )
ggplot(data=all.housing.insu, aes(x=region, y=round(Avg), fill=flag)) +
geom_bar(stat="identity", position=position_dodge())+ coord_flip() +
theme(axis.title.x = element_blank(),axis.text.y=element_text(size=rel(0.8))) + scale_fill_manual(values=c("#23DEDB", "#DEDE23")) +theme(legend.box.background = element_rect(),legend.box.margin = margin(6, 6, 6, 6)) + theme(plot.title = element_text(hjust = 0.5)) + dark_theme_linedraw() + theme(legend.background = element_rect(fill="black",
size=0.5, linetype="solid",
colour ="white"))
Code-13 : Value of Houses based On their Construction Year
library(ggplot2)
library(dplyr)
library(ggdark)
options(scipen=10000)
yearbuild.propertyval <- mainData %>% select(YBL , VALP)
yearbuild.propertyval <- na.omit(yearbuild.propertyval)
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 01 , "1939 or earlier")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 02 , "1940 to 1949")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 03 , "1950 to 1959")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 04 , "1960 to 1969")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 05 , "1970 to 1979")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 06 , "1980 to 1989")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 07 , "1990 to 1999")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 08 , "2000 to 2004")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 09 , "2005")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 10 , "2006")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 11 , "2007")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 12 , "2008")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 13 , "2009")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 14 , "2010")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 15 , "2011")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 16 , "2012")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 17 , "2013")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 18 , "2014")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 19 , "2015")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 20 , "2016")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 21 , "2017")
avg.price.yearwise <- yearbuild.propertyval %>% group_by(YBL) %>% summarise(mean.value = mean(VALP , na.rm = T))
ggplot(avg.price.yearwise, aes(x=`YBL`, y=mean.value , label = "")) +
geom_point(stat='identity', fill="black", size=6 , color="Red" , alpha = 0.6) +
geom_segment(aes(y = 0,
x = `YBL`,
yend = mean.value,
xend = `YBL`),
color = "white") +
geom_text(color="white", size=3) +
labs(title="Value of House Based on Construction Year",
subtitle="Does Construction Year Matters ?") + coord_flip() + dark_theme_linedraw() + xlab("Year of Build") + ylab("Average Price of House")
Code-14 : Linguistic Demographics versus Year of living
library(ggplot2)
library(dplyr)
library(ggdark)
move.data <- mainData %>% select(MV,HHL) %>% filter(MV != "NA" , HHL != "NA") %>% na.omit(move.data)
move.data$MV <- replace(move.data$MV, move.data$MV == 1 , "12 months or less")
move.data$MV <- replace(move.data$MV, move.data$MV == 2 , "13 to 23 months")
move.data$MV <- replace(move.data$MV, move.data$MV == 3 , "2 to 4 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 4 , "5 to 9 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 5 , "10 to 19 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 6 , "20 to 29 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 7 , "30 years or more")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 1 , "English only")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 2 , "Spanish")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 3 , "Other Indo-European languages")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 4 , "Asian and Pacific Island languages")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 5 , "Other language")
move.data <- move.data %>% group_by(MV, HHL) %>% summarise(count.values = n())
ggplot() + geom_point(data = move.data, aes(x = MV, y = count.values, size = 5, color = HHL, shape = HHL)) + coord_flip() +dark_theme_linedraw() + xlab("Living Here Since..") + ylab("Values") + labs(title = "Migrate Period Along with Linguistic Demographics")
Code-15 : Predicting the value of the house by family Income Using Support Vector Machine
library(caret)
library(dplyr)
library(kernlab)
# getting the data
f.income.house.size <- mainData %>% select(FINCP,VALP) %>% filter(FINCP != "NA" , VALP != "NA" ) %>% na.omit(f.income.house.size)
# selecting the minimum data because of machine limit
f.income.house.size <- head(f.income.house.size,100)
# partitioning the data
intrain <- createDataPartition(y = f.income.house.size$VALP, p= 0.7, list = FALSE)
training <- f.income.house.size[intrain,]
testing <- f.income.house.size[-intrain,]
training[["VALP"]] = factor(training[["VALP"]])
# anyNA() method, which checks for any null values
anyNA(f.income.house.size)
# summary of the data
summary(f.income.house.size)
# assigning the train control method here. Control all the computational overheads so that we can use the train() function provided by the caret package
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
# training the model usin
svm_Linear <- train(VALP ~ FINCP, data = training, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
# predicting the model
svm_Linear
# prediciting the respective values
test_pred <- predict(svm_Linear, newdata = testing)
test_pred
Code-16 : State Wise - Number of Workers Present in a House (Dynamic Graph)
library(dplyr)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(ggdark)
library(plotly)
state.workers.home <- mainData %>% select(ST, WIF) %>% filter(WIF != "NA") %>% na.omit(state.workers.home)
state.workers.home <- state.workers.home %>% dplyr::group_by(ST,WIF) %>% dplyr::summarise(count = n())
# Changing the section here.
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 0 , "Zero Worker")
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 1 , "One Worker")
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 2 , "Two Worker")
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 3 , "Three Worker")
# Changing the state names here.
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 1,"alabama")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 2,"alaska")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 4,"arizona")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 5,"arkansas")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 6,"california")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 8,"colorado")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 9,"connecticut")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 10,"delaware")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 11,"district of columbia")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 12,"florida")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 13,"georgia")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 15,"hawaii")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 16,"idaho")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 17,"illinois")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 18,"indiana")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 19,"iowa")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 20,"kansas")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 21,"kentucky")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 22,'louisiana')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 23,"maine")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 24,"maryland")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 25,'massachusetts')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 26,'michigan')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 27,'minnesota')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 28,'mississippi')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 29,'missouri')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 30,'montana')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 31,'nebraska')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 32,'nevada')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 33,'new hampshire')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 34,'new jersey')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 35,'new mexico')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 36,'new york')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 37,'north carolina')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 38,'north dakota')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 39,'ohio')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 40,'oklahoma')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 41,'oregon')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 42,'pennsylvania')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 44,'rhode island')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 45,'south carolina')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 46,'south dakota')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 47,'tennessee')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 48,'texas')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 49,'utah')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 50,'vermont')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 51,'virginia')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 53,'washington')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 54,'west virginia')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 55,'wisconsin')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 56,'wyoming')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 72,'puerto rico')
p <- ggplot(state.workers.home, aes(fill=WIF, y=count, x=ST)) +
geom_bar(position="stack", stat="identity") +
scale_fill_viridis(discrete = T) +
ggtitle("Statewise - Number of Workers Present in Home") +
theme_ipsum() +
xlab("State Names") + theme(axis.text.x = element_text(angle=90, hjust=1, size = 7))
ploting <- ggplotly(p)
ploting
Code-17 : Finding the Relationship between Number of Workers in a House and the Vehciles they Own
# relationship bw workers in the family and monthly rent
library(dplyr)
library(gplots)
library(ggdark)
worker.vehciles <- mainData %>% select(WIF, VEH) %>% filter(WIF != "NA" , VEH != 'NA') %>% na.omit(worker.vehciles)
tbltest <- table(worker.vehciles$WIF,worker.vehciles$VEH)
# performing the chi square test
chisq.test(tbltest)
balloonplot(t(tbltest), main ="Workers in House - Vehciles They Own", xlab ="Vehicles", ylab="No of Workers",
label = FALSE, show.margins = FALSE)
#create density curve
curve(dchisq(x, df = 18), from = 0, to = 40,
main = 'Chi-Square Distribution (df = 18)',
ylab = 'Density',
lwd = 2)
Code-18 : Compare the mean of the Gross Rent to the Standard Rent
library(dplyr)
library(ggpubr)
# filter the gross data from the main data and avoiding the null values and the values which are less than 0.
gross.rent <- mainData %>% select(GRNTP) %>% filter(GRNTP != "NA" , GRNTP != 0) %>% na.omit(month.rent)
# summary of the data
summary(gross.rent$GRNTP)
# plotting the histogram plot -to visualize the normality around the mean
ggplot(gross.rent, aes(x=GRNTP)) +
geom_histogram(aes(y=..density..), colour="black", fill="yellow" , alpha = 0.8)+
geom_density(alpha=.2, fill="#FF6666" , color = 'green') + labs(title="Histogram for Gross Rent") +
labs(x="Gross Rent", y="Desnity Values") + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(aes(xintercept=mean(GRNTP)),
color="blue", linetype="dashed", size=1)
# testing the normality here with qqplot
ggqqplot(gross.rent$GRNTP)
# for the large data set we use kolmogorov-smirnov test rather than shapiro wilkie test.
ks.test(gross.rent$GRNTP , y = 'pnorm',mean = 1070.964 , sd = 612.037)
# performig the one sample T- test here.
t.test(gross.rent$GRNTP, mu = 1012)
Code-19 : Neural Network - Predicting the Value of the Property by Corresponding relative Predictors
library(neuralnet)
library(dplyr)
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
nndata<- mainData %>% select(INSP,FINCP,FULP,GASP,VALP) %>% filter(FINCP != "NA" , INSP != "NA",FULP != "NA",GASP != "NA") %>% na.omit(nndata)
nndata <- head(nndata,1000)
# Random sampling
samplesizedata = 0.60 * nrow(nndata)
set.seed(80)
indexdatann = sample( seq_len ( nrow ( nndata ) ), size = samplesizedata )
# Create training and test set
datatrain = nndata[ indexdatann, ]
datatest = nndata[ -indexdatann, ]
## Scale data for neural network
maxnn = apply(nndata , 2 , max)
minnn = apply(nndata, 2 , min)
scaled = as.data.frame(scale(nndata, center = minnn, scale = maxnn - minnn))
trainNN = scaled[indexdatann , ]
testNN = scaled[-indexdatann , ]
set.seed(4)
NN = neuralnet(VALP ~ INSP + FINCP + FULP + GASP , trainNN, hidden = 3 , linear.output = T )
plot(NN , col.out = 'blue' , fontsize = 9,col.out.synapse = "red",col.intercept = "blue" , col.entry = 'skyblue',col.entry.synapse = 'red' , rep= "best")
In order to perform the actual analysis and to create fancy graphs and tables I have used couple of new packages which are already not pre-defined and needs installation in most of the cases.
Please note that , All the packages I have used over here can easily be installed by one single command.
install.packages(“package name”)
Following are the packages you need to install:-